SwePub
Tyck till om SwePub Sök här!
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "db:Swepub ;pers:(Lu Zhonghai);srt2:(2020-2024)"

Sökning: db:Swepub > Lu Zhonghai > (2020-2024)

  • Resultat 1-10 av 51
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Chen, Hui, et al. (författare)
  • A CORDIC-Based Architecture with Adjustable Precision and Flexible Scalability to Implement Sigmoid and Tanh Functions
  • 2020
  • Ingår i: IEEE International Symposium on Circuits and Systems, ISCAS 2020. - : IEEE.
  • Konferensbidrag (refereegranskat)abstract
    • In the artificial neural networks, tanh (hyperbolic tangent) and sigmoid functions are widely used as activation functions. Past methods to compute them may have shortcomings such as low precision or inflexible architecture that is difficult to expand, so we propose a CORDIC-based architecture to implement sigmoid and tanh functions, which has adjustable precision and flexible scalability. It just needs shift-add-or-subtract operations to compute high-accuracy results and is easy to expand the input range through scaling the negative iterations of CORDIC without changing the original architecture. We adopt the control variable method to explore the accuracy distribution through software simulation. A specific case (ARCH:(1, 15, 18), RMSE: 10(-6)) is designed and synthesized under the TSMC 40nm CMOS technology, the report shows that it has the area of 36512.78 mu m(2) and power of 12.35mW at the frequency of 1GHz. The maximum work frequency can reach 1.5GHz, which is better than the state-of-the-art methods.
  •  
2.
  • Chen, Hui, et al. (författare)
  • A General Methodology and Architecture for Arbitrary Complex Number Nth Root Computation
  • 2021
  • Ingår i: 2021 SCAS 2021/IEEE International Symposium on Circuits and Systems. - : Institute of Electrical and Electronics Engineers (IEEE).
  • Konferensbidrag (refereegranskat)abstract
    • As the existing complex number Nth root computation methods are relatively discrete, we propose a general method and architecture based on coordinate rotation digital computer (CORDIC) to compute arbitrary complex number Nth root for the first time. Our method performs the tasks of computing complex modulus, complex phase angle, real Nth root, sine function and cosine function, which can be implemented by circular CORDIC, linear CORDIC and hyperbolic CORDIC. Based on these CORDICs, our proposed architecture can not only improve the hardware efficiency just through shift-add operations, but also flexibly adjust the precision and the input range of complex number Nth root. To prove its feasibility, we conduct a software simulation and implement an example circuit in hardware. Under the TSMC 28nm CMOS technology, we synthesize it and get the report that it has the area of 6561 mu m(2) and the power of 3.95mW at the frequency of 1.5GHz.
  •  
3.
  • Chen, Hui, et al. (författare)
  • An Efficient Hardware Architecture with Adjustable Precision and Extensible Range to Implement Sigmoid and Tanh Functions
  • 2020
  • Ingår i: Electronics. - : MDPI. - 2079-9292. ; 9:10
  • Tidskriftsartikel (refereegranskat)abstract
    • The efficient and precise hardware implementations of tanh and sigmoid functions play an important role in various neural network algorithms. Different applications have different requirements for accuracy. However, it is difficult for traditional methods to achieve adjustable precision. Therefore, we propose an efficient-hardware, adjustable-precision and high-speed architecture to implement them for the first time. Firstly, we present two methods to implement sigmoid and tanh functions. One is based on the rotation mode of hyperbolic CORDIC and the vector mode of linear CORDIC (called RHC-VLC), another is based on the carry-save method and the vector mode of linear CORDIC (called CSM-VLC). We validate the two methods by MATLAB and RTL implementations. Synthesized under the TSMC 40 nm CMOS technology, we find that a special case AR divide VR(3,0), based on RHC-VLC method, has the area of 4290.98 mu m2 and the power of 1.69 mW at the frequency of 1.5 GHz. However, under the same frequency, AR divide VC(3) (a special case based on CSM-VLC method) costs 3196.36 mu m2 area and 1.38 mW power. They are both superior to existing methods for implementing such an architecture with adjustable precision.
  •  
4.
  • Chen, H., et al. (författare)
  • Huicore : A Generalized Hardware Accelerator for Complicated Functions
  • 2022
  • Ingår i: IEEE Transactions on Circuits and Systems Part 1. - : Institute of Electrical and Electronics Engineers (IEEE). - 1549-8328 .- 1558-0806. ; 69:6, s. 2463-2476
  • Tidskriftsartikel (refereegranskat)abstract
    • Emerging advanced System-on-Chip (SoC) designs contain more and more complicated functions to be accelerated. This presents a challenge to conventional design approaches which use different hardware architectures or separate hardware accelerators to implement the various functions. To tackle this challenge, for the first time, we propose a generalized hardware accelerator called 'Huicore' to speed up diverse functions on the same substrate. Through the analysis and transformation of mathematical characteristics, we reveal the commonality of many complicated functions using the CORDIC algorithm. Then we explore a reconfigurable architecture to implement them. The proposed reconfigurable accelerator can not only accelerate the implementation of many complicated functions, but also has small area, low power consumption and high precision. It is very suitable for integration in a SoC system to accelerate the implementation of various applications.
  •  
5.
  • Chen, Hui, et al. (författare)
  • Hyperbolic CORDIC-Based Architecture for Computing Logarithm and Its Implementation
  • 2020
  • Ingår i: IEEE Transactions on Circuits and Systems - II - Express Briefs. - : Institute of Electrical and Electronics Engineers (IEEE). - 1549-7747 .- 1558-3791. ; 67:11, s. 2652-2656
  • Tidskriftsartikel (refereegranskat)abstract
    • We present a CORDIC (Coordinate Rotation Digital Computer)-based method to compute the logarithm function with base 2 and validate this method by software simulation and hardware implementation. Technically, we overcome the limitation of traditional hyperbolic CORDIC and transform it based on the idea of generalized hyperbolic CORDIC so that it can be used to compute $log_{2}x\;(x\;\epsilon \;[1,2))$ . The proposed method requires only simple shift-and-add operations and has a great tradeoff between precision (or speed) and area. In MATLAB, we provide different precisions corresponding to the iterations of the transformed CORDIC for user needs. Using a pipelined structure and setting the number of iterations to be 16 (the average relative error is $2.09\times 10<^>{-6}$ ), we implement an example hardware circuit. Synthesized under the SMIC 65nm CMOS technology, the circuit has an area of 24100 $\mu m<^>{2}$ and computation time of 11.1 ns, which can save 31.04x0025; area and improve 6.92x0025; computation speed averagely compared with existing methods.
  •  
6.
  • Chen, Hui, et al. (författare)
  • Low-Complexity High-Precision Method and Architecture for Computing the Logarithm of Complex Numbers
  • 2021
  • Ingår i: IEEE Transactions on Circuits and Systems Part 1. - : Institute of Electrical and Electronics Engineers (IEEE). - 1549-8328 .- 1558-0806. ; 68:8, s. 3293-3304
  • Tidskriftsartikel (refereegranskat)abstract
    • This paper proposes a low-complexity method and architecture to compute the logarithm of complex numbers based on coordinate rotation digital computer (CORDIC). Our method takes advantage of the vector mode of circular CORDIC and hyperbolic CORDIC, which only needs shift-add operations in its hardware implementation. Our architecture has lower design complexity and higher performance compared with conventional architectures. Through software simulation, we show that this method can achieve high precision for logarithm computation, reaching the relative error of 10(-7). Finally, we design and implement an example circuit under TSMC 28nm CMOS technology. According to the synthesis report, our architecture has smaller area, lower power consumption, higher precision and wider operation range compared with the alternative architectures.
  •  
7.
  • Chen, Hui, et al. (författare)
  • Symmetric-Mapping LUT-Based Method and Architecture for Computing X-Y-Like Functions
  • 2021
  • Ingår i: IEEE Transactions on Circuits and Systems Part 1. - : IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. - 1549-8328 .- 1558-0806. ; 68:3, s. 1231-1244
  • Tidskriftsartikel (refereegranskat)abstract
    • We propose a new method and hardware architecture to compute the functions expressed as XY ( X and Y are arbitrary floating-point numbers), which can support arbitrary Nth root, exponential and power operations. Because of the complexity of direct computation, we usually convert it to logarithm, multiplication, and antilogarithm operations. Traditional approaches suffer from long latency, large area and high power consumption. To solve this problem, we propose a symmetric-mapping lookup table (SM-LUT) to be capable of computing log(2) x (x is an element of [1, 2]) and 2 x (x is an element of [0, 1]) simultaneously. It lays the foundation for computing XY. To further improve hardware performance of our architecture, we propose a multi-region address searcher to speed up the calculation of SM-LUT. In addition, we use an optimized Vedic multiplier to shorten the critical path and improve the efficiency of multiplication, which is included in computing X-Y. Under the TSMC 40nm CMOS technology, we design and synthesize a reference circuit to compute X-Y with a maximum relative error of 10(-3). The report shows that the reference circuit achieves the area of 14338.50 mu m(2) and the power consumption of 4.59 mW at the frequency of 1 GHz. In comparison with the state-of-the-art work under the same input range and similar precision, it saves 78.57% area and 80.42% power consumption for (N)root R computation and 82.89% area and 81.89% power consumption for R-N computation averagely. On top of that, our architecture reduces the computation latency by 62.77% averagely and has one more order of magnitude of energy efficiency than others.
  •  
8.
  • Chen, Qinyu, et al. (författare)
  • An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective
  • 2020
  • Ingår i: IEEE Transactions on Very Large Scale Integration (vlsi) Systems. - : Institute of Electrical and Electronics Engineers (IEEE). - 1063-8210 .- 1557-9999. ; 28:6, s. 1540-1544
  • Tidskriftsartikel (refereegranskat)abstract
    • Convolutional neural networks (CNNs) have emerged as one of the most popular ways applied in many fields. These networks deliver better performance when going deeper and larger. However, the complicated computation and huge storage impede hardware implementation. To address the problem, quantized networks are proposed. Besides, various convolutional structures are designed to meet the requirements of different applications. For example, compared with the traditional convolutions (CONVs) for image classification, CONVs for image generation are usually composed of traditional CONVs, dilated CONVs, and transposed CONVs, leading to a difficult hardware mapping problem. In this brief, we translate the difficult mapping problem into the sparsity problem and propose an efficient hardware architecture for sparse binary and ternary CNNs by exploiting the sparsity and low bit-width characteristics. To this end, we propose an ineffectual data removing (IDR) mechanism to remove both the regular and irregular sparsity based on dual-channel processing elements (PEs). Besides, a flexible layered load balance (LLB) mechanism is introduced to alleviate the load imbalance. The accelerator is implemented with 65-nm technology with a core size of 2.56 mm(2). It can achieve 3.72-TOPS/W energy efficiency at 50.1 mW, which makes it a promising design for embedded devices.
  •  
9.
  • Chen, Qinyu, et al. (författare)
  • Enabling Energy-Efficient Inference for Self-Attention Mechanisms in Neural Networks
  • 2022
  • Ingår i: 2022 Ieee International Conference On Artificial Intelligence Circuits And Systems (Aicas 2022). - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 25-28
  • Konferensbidrag (refereegranskat)abstract
    • The study of specialized accelerators tailored for neural networks is becoming a promising topic in recent years. Such existing neural network accelerators are usually designed for convolutional neural networks (CNNs) or recurrent neural networks have been (RNNs), however, less attention has been paid to the attention mechanisms, which is an emerging neural network primitive with the ability to identify the relations within input entities. The self-attention-oriented models such as Transformer have achieved great performance on natural language processing, computer vision and machine translation. However, the self-attention mechanism has intrinsically expensive computational workloads, which increase quadratically with the number of input entities. Therefore, in this work, we propose an software-hardware co-design solution for energy-efficient self-attention inference. A prediction-based approximate self-attention mechanism is introduced to substantially reduce the runtime as well as power consumption, and then a specialized hardware architecture is designed to further increase the speedup. The design is implemented on a Xilinx XC7Z035 FPGA, and the results show that the energy efficiency is improved by 5.7x with less than 1% accuracy loss.
  •  
10.
  • Chen, Yizhi, 1995-, et al. (författare)
  • Accelerating Non-Negative Matrix Factorization on Embedded FPGA with Hybrid Logarithmic Dot-Product Approximation
  • 2022
  • Ingår i: Proceedings. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 239-246
  • Konferensbidrag (refereegranskat)abstract
    • Non-negative matrix factorization (NMF) is an ef-fective method for dimensionality reduction and sparse decom-position. This method has been of great interest to the scien-tific community in applications including signal processing, data mining, compression, and pattern recognition. However, NMF implies elevated computational costs in terms of performance and energy consumption, which is inadequate for embedded applications. To overcome this limitation, we implement the vector dot-product with hybrid logarithmic approximation as a hardware optimization approach. This technique accelerates floating-point computation, reduces energy consumption, and preserves accuracy. To demonstrate our approach, we employ a design exploration flow using high-level synthesis on an embedded FPGA. Compared with software solutions on ARM CPU, this hardware implementation accelerates the overall computation to decompose matrix by 5.597 × and reduces energy consumption by 69.323×. Log approximation NMF combined with KNN(k-nearest neighbors) has only 2.38% decreasing accuracy compared with the result of KNN processing the matrix after floating-point NMF on MNIST. Further on, compared with a dedicated floating-point accelerator, the logarithmic approximation approach achieves 3.718× acceleration and 8.345× energy reduction. Compared with the fixed-point approach, our approach has an accuracy degradation of 1.93% on MNIST and an accuracy amelioration of 28.2% on the FASHION MNIST data set without pre-knowledge of the data range. Thus, our approach has better compatibility with the input data range.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-10 av 51

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy